String Reconstruction from Substring Compositions
نویسندگان
چکیده
Motivated by mass-spectrometry protein sequencing, we consider the problem of reconstructing a string from the multisets of its substring composition. We show that all strings of length 7, one less than a prime and one less than twice a prime, can be reconstructed uniquely up to reversal. For all other lengths, we show that unique reconstruction is not always possible and provide sometimes-tight bounds on the largest number of strings with given substring compositions. The lower bounds are derived by combinatorial arguments, while the upper bounds follow from algebraic approaches that lead to precise characterizations of the sets of strings with the same substring compositions in terms of the factorization properties of bivariate polynomials. Using results on the transience of multidimensional random walks, we also provide a reconstruction algorithm that recovers random strings over alphabets of size ≥ 4 from their substring compositions in optimal near-quadratic time. The problem considered is related to the well-known turnpike problem, and its solution may hence shed light on this longstanding open problem as well.
منابع مشابه
Tight Bounds for String Reconstruction Using Substring Queries
We resolve two open problems presented in [8]. First, we consider the problem of reconstructing an unknown string T over a fixed alphabet using queries of the form “does the string S appear in T ?” for some query string S. We show that every non-adaptive algorithm must make Ω(ǫn) queries in order to reconstruct a 1− ǫ fraction of the strings of length n. The second problem is reconstructing a s...
متن کاملانتخاب کوچکترین ابر رشته در DNA با استفاده از الگوریتم ازدحام ذرّات
A DNA string can be supposed a very long string on alphabet with 4 letters. Numerous scientists attempt in decoding of this string. since this string is very long , a shorter section of it that have overlapping on each other will be decoded .There is no information for the right position of these sections on main DNA string. It seems that the shortest string (substring of the main DNA string) i...
متن کاملConsensus Patterns parameterized by input string length is W[1]-hard
Where Ham() denotes the Hamming distance and S[a..b] is the substring of S starting in a and ending in b. This problem is one of many variations of the wellstudied Consensus String problem. It is similar to Consensus Substring in that the target string must be close to a substring of each input string (rather than the whole string). However, in the latter problem the distance to each input stri...
متن کاملThe (non-)existence of perfect codes in Lucas cubes
A Fibonacci string of length $n$ is a binary string $b = b_1b_2ldots b_n$ in which for every $1 leq i < n$, $b_icdot b_{i+1} = 0$. In other words, a Fibonacci string is a binary string without 11 as a substring. Similarly, a Lucas string is a Fibonacci string $b_1b_2ldots b_n$ that $b_1cdot b_n = 0$. For a natural number $ngeq1$, a Fibonacci cube of dimension $n$ is denoted by $Gamma_n$ and i...
متن کاملAccelerating Substring Searching: Breaking the I/O Barrier
The exponential increase in the size of string databases makes substring search a challenging problem. Current techniques suffer from both disk I/O and computational cost because of extensive memory requirements and large candidate sets. We accelerate string search tools and reduce their memory requirements by precomputing the associations between the database strings and the query string. Our ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- SIAM J. Discrete Math.
دوره 29 شماره
صفحات -
تاریخ انتشار 2015